[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models by orozery · Pull Request #37853 · vllm-project/vllm

orozery · 2026-03-23T05:27:36Z

This PR extends the offloading connector register_kv_caches function to support KV caches used in hybrid models.

We define a new CanonicalKVCaches class which captures:

The unique set of KV cache tensors (as tensors maybe shared by multiple layers)
Mapping each group to its relevant KV cache data (given by a tensor pointer + page size). The canonical tensors are each of dtype int8 and shape (num_blocks, page_size).

This PR also splits the offloading connector unit tests to multiple files.

mergify · 2026-03-23T05:28:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring to support KV cache offloading for hybrid models. The introduction of CanonicalKVCaches provides a clean abstraction over different KV cache layouts, improving modularity and simplifying the transfer handler logic. The accompanying test refactoring is also a good improvement.

I have one high-severity comment regarding an inconsistency in the data type of the canonicalized tensor, which could lead to maintenance issues in the future. Addressing this would make the new abstraction more robust and easier to reason about.

mergify · 2026-03-24T05:08:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche

hey @orozery , left only a few comments here mainly because I can understand what you're doing, but I am afraid I don't fully understand why you're doing it.

Could you elaborate on why we would need a CanonicalKVCacheTensor here and how is that abstraction making life easier for you in code/transfer?

NickLucche · 2026-03-24T11:21:18Z

+                    test_shape = attn_backends[layer_name].get_kv_cache_shape(
+                        num_blocks=1234,
+                        block_size=16,
+                        num_kv_heads=1,
+                        head_size=256,
+                    )
+                    num_blocks_logical_dim = test_shape.index(1234)


there's a get_kv_cache_block_dim API now we can use

Actually I see that I use test_shape a bit below for asserting (2, ...) for flash attention.

NickLucche · 2026-03-24T11:35:34Z

+                    page_size_bytes[layer_name] = layer_kv_cache_spec.page_size_bytes
+                    unpadded_page_size_bytes[layer_name] = replace(
+                        layer_kv_cache_spec, page_size_padded=None
+                    ).page_size_bytes
+
+                else:
+                    raise NotImplementedError
+


is there any way we can make this if-elif-else simpler by eg factoring out common assignment at the end?

All cases need to assign to tensors_per_block, page_size_bytes and unpadded_page_size_bytes.
To move assignments out I will need to introduce a local variable per each of these 3 dictionaries.
I don't see it simplifying.
But maybe I missed your point...

orozery · 2026-03-24T12:41:04Z

Could you elaborate on why we would need a CanonicalKVCacheTensor here and how is that abstraction making life easier for you in code/transfer?

The offloading connector supports pluggable backends (e.g. CPU backend, file-system in the future).
We want the backend interface to be simple as possible, and solve complexities once for all backends, instead of letting each backend handle complexity by itself.

One such complexity is registering the GPU KV caches.
The backend needs to deal with things like: split-k-v (flash-attention), MLA layout, mamba packed state, kernel block size,...

To avoid all of these complexities, we define this class:

class CanonicalKVCaches:
    """
    Canonicalized block-level representation of the KV caches.

    Composed of:
        - Unique list of KV cache data tensors,
          each with shape (num_blocks, page_size_in_bytes) and int8 dtype.
        - Per-group data references of the tensors.
          i.e. how each KV cache group maps to the tensors.
    """

This allows the backend to easily use the KV caches, without having to deal with all of the above complexities.

Before this PR, we handled SOME (but not all, e.g. mamba) of these complexities inside the CPU backend (cpu_gpu.py).
You can see that as a result of this PR, many lines of code are removed from cpu_gpu.py as it is given CanonicalKVCaches instead of the previous kv_caches: dict[str, torch.Tensor].

So basically, the offloading connector takes responsibility for "translating" the complex-layout kv_caches into a simple, canonical, easy to work with CanonicalKVCaches.

NickLucche

All cases need to assign to tensors_per_block, page_size_bytes and unpadded_page_size_bytes.

Yes that's the pattern I would like to factor out..
Possibly the canonical torch.tensor/raw creation in particular which is quite verbose.

Anyways I am unblocking given this is minor and could maybe be simply wrapped in a util or constructor method for CanonicalKVCaches, in order to keep core logic in worker file as lean as possible.

I'll leave it to you to shape as you see fit for best maintainability.

NickLucche · 2026-03-25T17:49:33Z

@orozery let's check CI

sorry misclicked on closing somehow >.<

orozery · 2026-03-26T06:32:50Z

@NickLucche there was an issue with unit tests (incl. nixl connector) that were using set_kv_cache_layout and were affecting each other as get_kv_cache_layout uses @functools.lru_cache.
I fixed it here for both our tests.

mergify · 2026-03-26T07:17:14Z

Hi @orozery, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

This commit extends the offloading connector register_kv_caches function to support KV caches used in hybrid models. We define a new CanonicalKVCaches class which captures: 1. The unique set of KV cache tensors (as tensors maybe shared by multiple layers) 2. Mapping each group to its relevant KV cache data (given by a tensor pointer + page size). The canonical tensors are each of dtype int8 and shape (num_blocks, page_size). This commit also splits the offloading connector unit tests to multiple files. Signed-off-by: Or Ozeri <oro@il.ibm.com>

rarepepi · 2026-03-27T05:37:05Z

excited for this! i think its very needed for my deepseekv3.2 nvfp4 setup @orozery ty 🙏

dannyboycrypt0 · 2026-03-27T05:37:56Z

i believe the merging of this PR would solve one of the issues we're working on right now as well.

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com>

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

### What this PR does / why we need it? Main2main upgrade vllm to 0330 fix breaks: 1. vllm-project/vllm#37728 add clear_row method for BlockTable 2. vllm-project/vllm#37975 Adapt GatedDeltaNetAttention Refactor 3. vllm-project/vllm#37698 update maybe_update_config in vllm_ascend/quantization/modelslim_config.py to adapt this pr change 4. vllm-project/vllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer 5. vllm-project/vllm#37853 for now just to skip test_cpu_offloading.py test case utils this feature has been adapted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: wangli <wangli858794774@gmail.com> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery requested review from ApostaC and NickLucche as code owners March 23, 2026 05:27

mergify Bot added the v1 label Mar 23, 2026

mergify Bot added needs-rebase kv-connector labels Mar 23, 2026

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py

orozery force-pushed the kv-offload-register-hybrid-kv-caches branch from 4b710fe to cc8dc31 Compare March 23, 2026 06:04

mergify Bot removed the needs-rebase label Mar 23, 2026

mergify Bot added the needs-rebase label Mar 24, 2026

orozery force-pushed the kv-offload-register-hybrid-kv-caches branch from cc8dc31 to 943e72a Compare March 24, 2026 09:01

mergify Bot removed the needs-rebase label Mar 24, 2026

NickLucche reviewed Mar 24, 2026

View reviewed changes

NickLucche approved these changes Mar 25, 2026

View reviewed changes

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026

NickLucche closed this Mar 25, 2026

NickLucche reopened this Mar 25, 2026

orozery force-pushed the kv-offload-register-hybrid-kv-caches branch from 89f5cca to 236714e Compare March 26, 2026 05:07

orozery requested review from LucasWilkinson and MatthewBonanni as code owners March 26, 2026 05:07

orozery force-pushed the kv-offload-register-hybrid-kv-caches branch from 236714e to 112c22b Compare March 26, 2026 07:08

orozery force-pushed the kv-offload-register-hybrid-kv-caches branch from 112c22b to 53eca8a Compare March 26, 2026 09:49

orozery added 2 commits March 26, 2026 13:30

Merge branch 'main' into kv-offload-register-hybrid-kv-caches

56a88a3

Merge branch 'main' into kv-offload-register-hybrid-kv-caches

d6a0ed4

orozery added 2 commits March 26, 2026 20:14

Merge branch 'main' into kv-offload-register-hybrid-kv-caches

fcef4e9

Merge branch 'main' into kv-offload-register-hybrid-kv-caches

34ab3db

orozery merged commit 7cc302d into vllm-project:main Mar 27, 2026
69 checks passed

nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026

[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models (v…

687d837

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models (v…

a8b2a4b

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com>

iboiko-habana mentioned this pull request Mar 31, 2026

[FIX_FOR_VLLM_CUSTOM=d28d86e8a34bf2617be294c235d6e6ef3321917b] vllm-project/vllm-gaudi#1279

Merged

22dimensions mentioned this pull request Apr 7, 2026

[CI] Main2main upgrade vllm to 0330 vllm-project/vllm-ascend#7962

Merged

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026

[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models (v…

ba34a20

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models (v…

30a1142

…llm-project#37853) Signed-off-by: Or Ozeri <oro@il.ibm.com>

Uh oh!

Conversation

orozery commented Mar 23, 2026

Uh oh!

mergify Bot commented Mar 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented Mar 24, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

orozery commented Mar 24, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orozery commented Mar 26, 2026

Uh oh!

mergify Bot commented Mar 26, 2026

Uh oh!

rarepepi commented Mar 27, 2026

Uh oh!

dannyboycrypt0 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NickLucche commented Mar 25, 2026 •

edited

Loading